-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
65 reweighting Scottish EPC records #90
base: dev
Are you sure you want to change the base?
Conversation
… for reweighting remove unused functions from get_target.py
… reweighting for tenure and property type
…2 features and England and Wales with 3
4 Scottish DataZones have 0 values for all property types and all tenure types. These data cannot be used for reweighting
…if sample reduced to len 0
… information for dummy rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Roisín,
Great piece of coding! 👍
I can imagine it must have been mildly annoying to create two separate sets of getter functions for EW and Scottish data, but you've done a good job. Thank you for clearly documenting your expectations in the PR—those guidelines were really helpful to follow! 👍
Minor Suggestions:
- Config File: A few tweaks to improve clarity.
- Documentation: Some additional details could make the code easier to understand for others.
- Assertions: I’ve added a couple of suggestions.
Output Review:
I checked the output files and performed a sanity check on some of the results. Everything looks fine for both datasets! However, I noticed we lost a total of 23,801 rows, which averages out to about 0.5 rows per LSOA. This doesn’t seem significant but might be worth exploring further to understand why we're losing those rows. That said, it's more of a second/third-order concern at this stage.
Performance:
The processing times range from 0.05 seconds to 6.9 seconds, which seems reasonable. It's expected that some LSOAs take longer due to differences in their distribution.
- Reweighting has been applied correctly to each nation in the code. I.e. Scotland is reweighted appropriately on property_type and tenure features only, whereas England and Wales should also be reweighted on build_year Checked this, seems to be good.
- Joins to make sure they are correct and we are not accidentally losing data we should keep. My intention is to retain all the rows that get assigned a weight, but not to keep the unweighted ones. The idea is to join this dataset back onto the full EPC dataset in a separate processing script. I double checked the joins and the logic, it checks out
- The preprocessing steps in run_compute_epc_weights.py before reweighting occurs to ensure they are applied in the appropriate order before and after filtering for country Preprocessing steps seem to be in the correct order.
- Feel free to suggest places where we can add test/assertions to ensure the pipeline is executing the expected behaviours. I'd like to look at this properly myself later and add some in a separate PR. I've added some suggestions!
Hope this is helpful!
asf_heat_pump_suitability/pipeline/run_scripts/run_compute_epc_weights.py
Show resolved
Hide resolved
"BUILT_FORM", | ||
"CONSTRUCTION_AGE_BAND", | ||
], | ||
) | ||
|
||
# Join ONSPD LSOA col | ||
epc_df = output_areas.standardise_col_postcode(epc_df, pcd_col="POSTCODE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe after standardisation you do a regex check to see if the postcode is in the correct format:
valid_postcode_mask = epc_df["POSTCODE"].str.contains(r"^[A-Z0-9 ]+$")
assert valid_postcode_mask.all(), "Some standardised postcodes contain invalid characters."
@@ -102,66 +110,79 @@ def parse_arguments() -> argparse.Namespace: | |||
# 1. Add standardised weighting feature columns to EPC and drop rows missing data required for reweighting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially something like:
assert (epc_df["lsoa"].is_null().sum() / epc_df.shape[0]) < 0.05, "More than 5% of rows missing lsoa after ONSPD join, which is unexpected."
logging.info( | ||
f"Running reweighting for {key}. Reweighting using the following features: {features}" | ||
) | ||
epc_cleaned_df = epc_df.filter(pl.col("COUNTRY") == key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert epc_cleaned_df.shape[0] > 0, f"No rows found for {key}, expected some data."
weights["UPRN"].extend(_weights["UPRN"]) | ||
# Adding LSOA required for dummy rows | ||
weights["lsoa"].extend([lsoa for i in range(len(_weights["UPRN"]))]) | ||
weights["weight"].extend(_weights["weight"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question more so for you, but maybe we can make an assert statement here for the weights to make sure that they all add up to 1 overall (if that makes sense?)
Fixes #65 and #89
Description
Add reweighting of Scottish EPC records into pipeline. Summary of changes:
config/base.yaml
- add new Scottish census datasets for reweightingget_target.py
- add new getters to get Scottish target data for reweighting. (for property type and tenure but not build year)prepare_target.py
- use new getter to prepare target dataset for reweightingprepare_sample.py
- changes to shorten category names in property type featurerun_compute_epc_weights.py
- update run script to reweight Scotland with 2 features and England and Wales with 3reweight_epc.py
- add TODO noteNB: information about data sources will be added to the README in a separate PR.
Instructions for Reviewer
To run code:
python -i asf_heat_pump_suitability/pipeline/run_scripts/run_compute_epc_weights.py --epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet -y 2023 -q 4
NB: The whole pipeline will take a few hours to run so please could you just test to make sure it works for Scotland and then you can interrupt the run.
Here are the latest outputs of the pipeline which you can load and check instead of running it all through:
Weights: s3://asf-heat-pump-suitability/outputs/2023Q4/20241129_2023_Q4_EPC_weights.parquet
Stats: s3://asf-heat-pump-suitability/outputs/2023Q4/20241129_2023_Q4_EPC_weights_stats.parquet
Please pay special attention to ...
run_compute_epc_weights.py
before reweighting occurs to ensure they are applied in the appropriate order before and after filtering for countryChecklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s